-
Notifications
You must be signed in to change notification settings - Fork 1.6k
fix(parquet): write single file if option is set #17009
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
1c05032
to
f1d58c5
Compare
I restarted the checks |
Sorry for the delay -- this PR seems to have quite. few conflicts. |
00148b2
to
651cfb8
Compare
No worries! My bad, I forgot to sync before rebasing the branch onto main again. Thx! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This way of implicitly encoding single fileness via adding an extension to the path does seem quite hacky; not sure if this is the correct approach to follow 🤔
let path = if file_type.get_ext() != DEFAULT_PARQUET_EXTENSION | ||
&& options.single_file_output | ||
{ | ||
let mut path = path.to_owned(); | ||
path.push_str(SINGLE_FILE_EXTENSION); | ||
path | ||
} else { | ||
path.to_owned() | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part confuses me; If I'm not wrong file_type.get_ext() != DEFAULT_PARQUET_EXTENSION
will always be true, because:
datafusion/datafusion/datasource-parquet/src/file_format.rs
Lines 157 to 162 in 602475f
impl GetExt for ParquetFormatFactory { | |
fn get_ext(&self) -> String { | |
// Removes the dot, i.e. ".parquet" -> "parquet" | |
DEFAULT_PARQUET_EXTENSION[1..].to_string() | |
} | |
} |
let test_path = std::path::Path::new(&path); | ||
assert!( | ||
test_path.is_dir(), | ||
"No extension and default DataFrameWriteOptons should have yielded a dir." | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we would need to also check there are indeed multiple parquet files
let test_path = std::path::Path::new(&path); | ||
assert!( | ||
test_path.is_file(), | ||
"No extension and DataFrameWriteOptons::with_single_file_output(true) should have yielded a single file." | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly here we would need to check that only one file is written, not just a file is written
#13323
Which issue does this PR close?
DataFrameWriteOptions::with_single_file_output
produces a directory #13323Rationale for this change
DF.write_parquet
writes multiple files / one directory even ifoptions.single_file_output
is set.What changes are included in this PR?
Introduce an internal
.single
extension.Are these changes tested?
Yes, tests are part of this PR.
Are there any user-facing changes?
Not in this implementation. There might be, if we decide to move to an
FileSinkConfig
based solution.Quoting: #13323 (comment)